Introduction to RAGE X

What is REGEX?

1) Short for regular expression, a regex is a string of text that lets you create patterns that help match, locate, and manage text.
2) Perl is a great example of a programming language that utilizes regular expressions.
3) Database management system (DBMS) utilises queries to pull out information from the database.
4) However, its only one of the many places you can find regular expressions. Regular expressions can also be used from the command line and in text editors to find text within a file.
5) Regular expressions are useful in search and replace operations.
6) The typical use case is to look for a sub-string that matches a pattern and replace it with something else. Most APIs using regular expressions allow you to reference capture groups from the search pattern in the replacement string.

How to write regular expression?

1) Repeaters : * , + and { }:

  • These symbols act as repeaters and tell the computer that the preceding character is to be used for more than just one time.

  • 2) The asterisk symbol ( * ):

  • It tells the computer to match the preceding character (or set of characters) for 0 or more times (upto infinite).

  • 3) The Plus symbol ( + ):

  • It tells the computer to repeat the preceding character (or set of characters) for atleast one or more times(upto infinite).

  • 4) The curly braces {…}:

  • It tells the computer to repeat the preceding character (or set of characters) for as many times as the value inside this bracket.

  • 5) Wildcard - ( . ):

  • The dot symbol can take place of any other symbol, that is why it is called the wildcard character.

  • 6) Optional character - ( ? ):

  • This symbol tells the computer that the preceding character may or may not be present in the string to be matched.

  • 7) The caret ( ^ ) symbol:

  • Setting position for match :tells the computer that the match must start at the beginning of the string or line.

  • 8) The dollar ( $ ) symbol:

  • It tells the computer that the match must occur at the end of the string or before \n at the end of the line or string.

  • Components of REGEX :

    1) A character class : matches any one of a set of characters. It is used to match the most basic element of a language like a letter, a digit, space, a symbol etc.
    /s : matches any whitespace characters such as space and tab
    /S : matches any non-whitespace characters
    /d : matches any digit character
    /D : matches any non-digit characters
    /w : matches any word character (basically alpha-numeric)
    /W : matches any non-word character
    /b : matches any word boundary (this would include spaces, dashes, commas, semi-colons, etc)

  • The Escape Symbol : \
    If you want to match for the actual '+', '.' etc characters, add a backslash( \ ) before that character. This will tell the computer to treat the following character as a search character and consider it for matching pattern.
  • Grouping Characters ( )
    A set of different symbols of a regular expression can be grouped together to act as a single unit and behave as a block, for this, you need to wrap the regular expression in the parenthesis( ).
  • Vertical Bar ( | ) :
    Matches any one element separated by the vertical bar (|) character.
  • \number :
    Backreference: allows a previously matched sub-expression (expression captured or enclosed within circular brackets ) to be identified subsequently in the same regular expression. \n means that group enclosed within the n-th bracket will be repeated at current position.
  • Comment : (?# comment) -
    Inline comment: The comment ends at the first closing parenthesis.
  • 2) Groups 3) Quantifiers 4) Backreferences 5) Anchors, Boundaries, Delimiters 6) Lookarounds 7) Modifiers

    Performance Pitfalls :

  • You should know some things about how your regex engine works since two "equivalent" regexes can have drastic differences in processing speed.
    ◉ It is possible to write regexes that take exponential time to match, but you pretty much have to TRY to make one (they're pathological)
    ◉ It is more common to accidentally create regexes that run in quadratic time
  • Main types of problems
    ◉ Recompilation (from forgetting to compile regexes used multiple times)
    ◉ Dot-star in the Middle (which causes backtracking)
    -Solution 1: Use negated character class
    -Solution 2: Use reluctant quantifiers
    ◉ Nested Repetition
  • Performance Tips :

  • Use non-capturing groups when you need parentheses but not capture.
  • If the regex is very complex, do a quick spot-check before attempting a match, e.g.
    ◉ Does an email address contain '@'?
  • Present the most likely alternative first, e.g.
    ◉ black| white|blue| red|green| metallic seaweed
  • Reduce the amount of looping the engine has to do
    ◉ \d\d\d\d\d is faster than \d{5} ◉ aaaa+ is faster than a{4,}
  • Avoid obvious backtrack e.g.
    ◉ Mr|Ms| Mrs should be M(?:rs?|s) ◉ Good morning |Good evening should be Good (?:morning| evening)
  • Miscellaneous Language-Specific Notes -